Population stratification occurs when the study population under investigation comprises several differnt subpopulation that differ in both genetic ancestry and in the phenotype of interest. As such, spurious apparent associations can be due to genetic ancestry rather than true associations of alleles with the phenotype. Principal component analysis (PCA) can be used to identify population outliers by perfoming a PCA in a reference panael such as 1000 genomes and projecting the sample of interest onto the resulting space.
To perform a PCA analysis in PLINK, first genotype data and pedigree infromation for the 1000 genomes reference panal needs to be downloaded (http://www.internationalgenome.org/data) and converted into a plink format. Two additional files are required, one listing the FID, IID and population for each sample in the reference dataset and the second listing the population clusters in the reference data. The alleles in the sample dataset need to be alighed to the same DNA strand as the refernce dataset to allow the datasets to be merged correctly. Both datasets should be LD pruned to eliminate a large degree of the redundency in the data and reduce the influce of potential chromsomal artifacts and related samples should be excluded. The following commands for performing the PCA analysis shouild be entered at the shell prompt to generate two files containing the principal component eigenvalues and Principal component eigenvectors.
plink_1.9 --bfile merged-reference-sample-data \
--pca 10 --within sample_population.txt \
--pca-clusters population_clusters.txt \
--out pca.output
The below scree plot shows the amount of variation retained by each principal component (Fig. 1) and the cumualtive proportion of variance explained by each principal compoent (Fig. 2). The number of principal compoents to include in the anlysis can be determined by the number of components that account for 95% of variation.
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
Fig. 1: Scree Plot of PVE for 10 Principal components (left) and cumaltive PVE for 10 Principal Components (right)
The following pairs plot displays the population structure across the first 4 principal components for DIAN compared with the refernce populations from 1000 genomes.
Fig. 3: Population Structure Pairs Plots
The following plots show the population structure of DIAN (Black Triangles) based on the first two (Fig. 4) and three (Fig. 5) principal components compared with the reference populations from 1000 Genomes. The static plot is zoomed into the European refernce super population composed of Utah Residents (CEPH) with Northern and Western European Ancestry (CEU), Toscani in Italia (TSI), Finnish in Finland (FIN), British in England and Scotland (GBR), and Iberian Population in Spain (IBS).
Fig. 4: Ancestery clustering based on PC 1 & 2
Fig. 5: Tridimensional plot of ancestery clustering of PC 1, 2 & 3
Individules of non-European ancestry were identified by determining the mean and standard deviation of the 10 principal components scores for the EUR super population (Table 1). Participants in DIAN who were 6 SD from the EUR population mean were determined to be of non-European ancestry (Table 2).
Fig. 6: EUR population outliers on PC 1 & 2
Fig. 7: Tridimensional Plot displaing population outliers on PC 1, 2, & 3